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Abstract 

Discrete Markov random fields form a natural class of models to rep¬ 
resent images and spatial data sets. The use of such models is, how¬ 
ever, hampered by a computationally intractable normalising constant. 

This makes parameter estimation and a fully Bayesian treatment of dis¬ 
crete Markov random fields difficult. We apply approximation theory for 
pseudo-Boolean functions to binary Markov random fields and construct 
approximations and upper and lower bounds for the associated computa¬ 
tionally intractable normalising constant. As a by-product of this process 
we also get a partially ordered Markov model approximation of the bi¬ 
nary Markov random field. We present numerical examples with both the 
pairwise interaction Ising model and with higher-order interaction models, 
showing the quality of our approximations and bounds. We also present 
simulation examples and one real data example demonstrating how the 
approximations and bounds can be applied for parameter estimation and 
to handle a fully-Bayesian model computationally. 

Keywords: Approximate inference; Bayesian analysis; Discrete Markov ran¬ 
dom fields; Image analysis; Pseudo-Boolean functions; Spatial data; Variable 
elimination algorithm. 


1 Introduction 

In statistics in general and perhaps especially in spatial statistics we often find 
ourselves with distributions known only up to an unknown normalising constant. 
Calculating this normalising constant typically involves high dimensional sum¬ 
mation or integration. This is the case for the class of discrete Markov random 
fields (MRF). A common situation in spatial statistics is that we have some un¬ 
observed latent field x for which we have noisy observations y. We model x as 
an MRF with unknown parameters 9 for which we want to do inference of some 
kind. If we are Bayesians we could imagine adding a prior for our parameters 
9 and studying the posterior distribution p{9\y). A frequentist approach could 
involve finding a maximum likelihood estimator for our parameters. Without 
the normalising constant, these become non-trivial tasks. 
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There are a number of techniques that have been proposed to overcome this 
problem. The normalising constant can be estimated by running a Markov 
chain Monte Carlo (MCMC) algorithm, which can then be combined with 
various techniques to produ c e maximum likelihood es tima tes, see for instance 


Gever and Thompson! (j 19951 1 , iGelman and Mend (j 199811 and iGu and Zhul (|200lll . 


Ot her approaches take advantage of the fact that exact sampling can be done. 


see 


Moller et al. ( 200(tI ). In the present report however, we focus on the class of 


deterministic methods, where deterministic in this setting is referring to that re¬ 


peatin g the estimation process yields the same estimate. In iBeeves and Pettitt 


(j2004 l the authors devise a computationally efficient algorithm for handling so 
called general factorisable models of which MRFs are a common example. This 
algorithm, which we refer to from here on as the variable elimination algorithm, 
grants a large computational saving in calculating the normalising constant by 
exploiting the factorisable structure of the models. For MRFs defined on a lat¬ 
tice this allows for calculation of the normalising constant on la ttices with up 
to aro und 2 0 rows for models with hrst order neighbourhoods. In iFriel and Rue 
( 2007 1 and Friel et al. ( 2009l l the authors construct approximations for larger 
la ttices by doing computatio ns for a number of sub-lattices using the algorithm 


m 


Reeves and PettittI ( 20041 1. 


The energy function of a binary MRF is an example of a so called pseudo- 
Boolean function. In general, a pseudo-Boolean function is a function of the 
following type, / : {0,1}” —> M. A full representation of a pseudo-Boolean 
function requires 2"' terms. Finding approximate representations of pseudo- 
Boolean functions tha t requ ire f ewer coefficients is a wel l studied field, see 
Hammer and Holzman ( 19921 ) and Grabisch et al. ( 200fll ). In Hammer and R.udeanu 


( 1968[ l the authors show how any pse udo-Boolean function can be ex pressed as 
a binary polynomial in n variables. Tielmeland and AustadI ( 2012 1 expressed 
the energy function of MRFs in this manner and by dropping small terms during 
the variable elimination algorithm constructed an approximate MRF. 

Our approach and the main contribution of this paper is to apply and ex¬ 
tend approximation theory for pseudo-Boolean function to design an approxi¬ 
mate variable elimination algorithm. By approximating the binary polynomial 
representing the distribution before summing out each variable we get an algo¬ 
rithm less restricted by the dependence structure of the model, thus capable of 
handling MRFs defined on large lattices and MRFs with larger neighbourhood 
structures. For the MRF application this approximation defines an approxima¬ 
tion to the normalising constant, and as a by-product we also get a partially 
ordered Markov model (POMM) approximation to the MRF. For the POMM 
approximation we can calculate the normalising constant and evaluate the like¬ 
lihood, as well as generate realizations. We also discuss how to modify our 
approximation strategy to instead get upper and lower bounds for the normalis¬ 
ing constant, and how this in turn can be used to construct an interval in which 
the maximum likelihood estimate must lie. We also discuss another variant of 
the approach w hich produces an approximate version of the Viterbi algorithm 
( Kiinsch . 2001 1. 

The article has the following layout. In Section [2] we define pseudo-Boolean 
functions and give a number of approximation theorems for this function class. 
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Thereafter, in Section [3] we introduce binary MRFs and the variable elimination 
algorithm, and in Section 0] we apply the approximation theorems for pseudo- 
Boolean functions to define our approximative variable elimination algorithm 
for binary MRFs. In Section [5] we define a modified variant of the approximate 
variable elimination algorithm through which we obtain upper and lower bounds 
for the normalising constant of a binary MRF, and we discuss how to modify 
the approximation algorithm to obtain an approximate version of the Viterbi 
algorithm. We briefly discuss some implementational issues in Section [6l and 
in Section [7] we present simulation and data examples. Finally, in Section [8] we 
provide closing remarks. 


2 Pseudo-Boolean functions 


In this section we introduce the class of pseudo-Boolean functions and discuss 
various aspects of approximating pse udo- Boolean functions part ly based on re¬ 
sults of lHammer and Holzmanl (1 19921 ) and iGrabisch et al.l (12001)1 ). 


2.1 Definition and notation 


Let X = {xi,... ,Xn) e ft = {0,1}"' be a vector of binary variables and let 
N = {1,... , n} be the corresponding list of indices. Then for any subset A N 
we associate an incidence vector x of length n whose kth element is 1 if A; e A 
and 0 otherwise. We refer to an element of x, x^, as being "on" if it has value 1 
and "off" if it is 0. A pseudo-Boolean function /, of dimension dim(/) = n, is a 
fu nction that associates a r eal va lue to each vector, x e {0,1}", i.e / : {0,1}" ^ 
M. Hammer and R.udeanu ( 19681 ) showed that any pseudo-Boolean function can 


be expressed uniquely as a binary polynomial, 


fix) = 2 


( 1 ) 


Aeiv 


fcsA 


where (5^ are real coefficients which we refer to as interactions. We define the 
degree of /, deg(/) as the degree of the polynomial. In general the representation 
of a function in this manner requires 2" coefficients. In some cases one or more 
might be zero and in this case a reduced representation of the pseudo-Boolean 
function can be defined by excluding some or all the terms in the sum in ([TJ 
where ( 5 ^ = 0 . Thus we get, 

fix) = Yi ( 2 ) 

AeS keA 

where S' is a set of subsets of N at least containing all A Q A^ for which /3^ ^ 
0. We then say that fix) is represented on S. Moreover, we say that our 
representation of / is dense if for all AeS, all subsets of A are also included in 
S. The minimal dense representation of / is thereby (j2]) with, 

S = {A c A^ ; ^ 0 for some A 2 A}. (3) 


Throughout this report we restrict the attention to dense representations of 
pseudo-Boolean functions. 
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We also need notation for some subsets of S and n. For A e S' we define 
Sa = {A e S : A c A}, the set of all interactions that include A. For example if 
n = 3 and S = {0, {!}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, {1,2,3}}, we have S{i, 2 } = 
{{1, 2}, {1,2, 3}}. Equivalently for the set ft, for A e S we define = {x e Q : 
Xk = 1, V/c e A}, the set of all states x where Xk are on for all k e X. For the 
same S as above we have for instance Al{i, 2 } = {(1) IjO)! (Ij 1) !)}■ We will use 
also the complements of these two subsets, S^ = S\Sx and = ^}\Qx- Lastly 
we define = {A e S : X n A = 0} and = {x 6 n : = 0, Vfc e A}. If 

we think of the sets Sx and Oa as the sets where A is on, then S^ and are 
the sets where A is off. Again, using the same example S as above we have for 
instance 2 } ^ {0,3} and 2 } = {{0, 0,0}, {0,0,1}}. Note that in general 
and equivalently ^ 

2.2 Approximating pseudo-Boolean functions 

For a general pseudo-Boolean function, the number of possible interactions in 
our representation grows exponentially with the dimension re. It is therefore 
natural to ask if we can find an approximate representation which requires less 
memory. We could choose some set S S to define our approximation, thus 
choosing which interactions to retain, S, and which to remove, S\S. For a given 
S our interest lies in the best such approximation according to some criteria. We 
define Ag{f{x)} = /(x) = operator which returns the 

approximation that, for some given approximation set S, minimises the error 
sum of squares (SSE), 


SSE(/,/) = 2 {/(0-/»}'• (4) 

xeQ 

We find the best approximation by taking partial derivatives with respect to 
for all A e S' and setting these expressions equal to zero. This gives us a system 
of linear equations, 


2 /(^) 


L 


fcsA / x£Q\ 


(5) 


yAeS 


Existence and uniqueness of a solution is assured since we have |S| unknown 
variables and the same number of linearly independent equations. If S 2 S the 
best approximation is clearly the function itself, /(x) = /(x). 

It is common practice in statistics and approximation theory in general to 
approximate higher order terms by lower order terms. A natural way to design 
an approximation would be to let S include all interactio ns of degree less than 
or equal to some value k. In Hammer and Holzman ( 1992f l the authors focus on 
approximations of this type and proceed to show how the resulting system of 
linear equations through clever reorganisation can be transformed into a lower 
triangular system. They sol ve this for k = 1 and k = 2 as well as proving a 
number of useful properties. Grabisch et al. ( 200Clh solve this for general k. 
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In the present article we consider the situation where 5 is a dense subset 
of S. So if A e S, then all A c A must also be included in S. Clearly the 
approximation using all interactions up to degree A: is a special case of our class 
of approximations. Our motivation for studying this particular design of S will 
become clear as we study the variable elimination algorithm in Section [3l We 
now give some useful properties of thi s approximation. The two first theorems 
are from Hammer and Holzman ( 19921 ). 


Theorem 1. The above approximation Ag{f{x)} is a linear operator, i.e. for 
any constants a,b eM and pseudo-Boolean functions g{x) and h{x) represented 
on S, we have that A^{ag{x) + bh{x)} = aAg{g{x)} + bA^{h{x)}. 


Theorem 2. Assume we have two approximations of f{x), Ag{f{x)} and A:{f{x)}, 
such that S S S. Then A^[Ag{f{x)}] = Az[f(^x)}. 


Proofs can be found in Hammer and HolzmarJ (1992) and in Austad ( 2011 '). 
Since each interaction term in a pseudo-Boolean function is a pseudo-Boolean 
function in itself, Theorem [T] is important because it means that we can ap¬ 
proximate a pseudo-Boolean function by approximating each of the interaction 
terms involved in the function individually. Also, since the best approximation 
of a pseudo-Boolean function is itself, we only need to worry about how to ap¬ 
proximate the interaction terms we want to remove. Theorem [2] shows that a 
sequential scheme for calculating the approximation is possible. 

Next we give two theorems characterising the properties of the error intro¬ 
duced by the approximation. 


Theorem 3. Assume again that we have two approximations of f{x), A^{f{x)} 
and A^{f{x)}, such that S S S. Letting f{x) = Ag{f{x)} and f{x) = 
A--{f{x)}, we then have SSE{fJ) = SSE{fJ) + SSE{fJ~). 

Theorem 4. Given a pseudo-Boolean function f{x) and an approximation f{x) 
constructed as described, the error sum of squares can be written as. 


2 {/(^) “ /(^)} 


E T 2 l/(i) - /(i)l 

keS\S L 


( 6 ) 


Proofs of Theorems [3]and|4] are in Appendices lAl and IbI respectively. Note 
that Theorem |4] tells us that the error can be expressed as a sum over the /3’s 
that we remove when constructing our approximation. Note also the special 
case where S = S\\ for some A e S', i.e we remove only one interaction 15^. 
Then, 


2 {/(^) “ /(®)} 

xeQ 




2 fi^)} 

_XGftx 


(7) 


With these theorems in hand we can go from S to S by removing all nodes in 
S\S. Theorems [Hand[2] allow us to remove these interactions sequentially one at 
a time. We start by removing the interaction (or one of, in the case of several) 
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(5^,X e S\S with highest degree and approximate it by the set containing all 
A c A. In other words, if the interaction has de gree fc = |A| we d esign the k — 1 


order approximation of that interaction term, 
the expression for this, 


Grabisch et ah ( 200d 'l gives ns 


/?- 


1 


if VA c A, 
otherwise. 


( 8 ) 


We then proceed by removing one interaction at a time until we reach the set of 
interest, S. The approximation error in one step of this procedure is given by 
([7|) and the total approximation error is given as the sum of the errors in each 
of the approximation steps. 


2.3 Second order interaction removal 


In this section we discuss pseudo-Boolean function approximation for a specific 
choice of S, which is of particular interest for the variable elimination algorithm. 
We show how we can construct a new way of solving the resulting system of 
equations and term this approximation the second order interaction removal 
(SOIR) approximation. 

For i, j e N, i j and {i,j} £ S, assume we have S = In other words 

we want to remove all interactions involving both i and j and approximate 
these by lower order interactions. To find this approximation we could of course 
proceed as in the previous section, sequentially removing one interaction at the 
time until we reach our desired approximation. However, the following theorem 
gives explicit expressions for both the approximation and the associated error. 

Theorem 5. Assume we have a pseudo-Boolean function f{x) represented on a 
dense set S. For i,jeN, i ^ j and {i, j} £ S, the least squares approximation 
of f{x) on S = is given by 


fix) = Ag{f{x)} = 2 n 

AeS keA 

where for A e S is given as 

2/A u {i,j] £ S', 
z/ A u {z} £ S and A u {j} ^ S, 
if Ayj {i} ^ S and A u {j} £ S, 
otherwise. 

The associated approximation error is 


r = 


l/^AuM 

?A _L 

/3^ 




(9) 


( 10 ) 


fix) - fix) = XiXj + 


1 1 
- Xi 

4 2 


1 

-Xn 
2 ^ 


E 


Ae5, 




/ 3 ^ n 

keA\{i,j} 


( 11 ) 


A proof is given in Appendix O Clearly, the approximation solution of this 
theorem corresponds to the solution we would get using the sequential scheme 
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discussed above, but (|10p is much faster to calculate and the above theorem has 
the advantage of giving us a nice explicit expression for the error. Note that the 
absolute value of the parenthesis outside the sum in (HID is always | and thus 
the absolute value of f{x) — f{x) does not depend on Xi or xj. We thereby get 
the following expression for the error sum of squares, 

SSE{f,f) = Y,{m-f{x)f = \ 2 ( s n ■ (12) 

xeQ xsrijj j} \ AeS{ij} keA\{i,j} j 


2.4 Upper and lower bounds for pseudo-Boolean functions 

In this section we construct upper and lower bounds for pseudo-Boolean func¬ 
tions. We denote the upper and lower bounds by fu{x) and fiix), respectively, 
i.e. we require /l(x) ^ f{x) ^ fu[x) for all x e fl. Just like for the SOIR 
approximation /(x), we require that all interactions involving both i and j are 
removed from fu{x) and fiix)- Using the expression for the SOIR approxima¬ 
tion error in m we have 


/(x) = /(x) -h 



(13) 


The terms that are constant or linear in Xj and Xj can be kept unchanged in 
fu{x) and /l(x), whereas we need to find bounds for the interaction part 


g{x) = XiXj ^ 1/3^ Yl I • (14) 

\ k£A\{i,j] J 

As Xj e {0,1}, an upper bound for g(x) which is linear in Xj and constant as a 
function of xj is 

g(x) ^ Xi max i 0, ^ i I3^ j I . (15) 

t AsSfij} y keA\{i,j} J J 

An upper bound which is linear Xj and constant as a function of x* is correspond¬ 
ingly found by using that x* e {0,1}. Moreover, any convex linear combination 
of these two bounds is also a valid upper bound for g{x). Similar reasoning 
for a lower bound produces the same type of expressions, except that the max 
operators are replaced by min operators. 

The bounds defined above are clearly valid upper and lower bounds, and by 
construction they have no interactions involving both i and j. However, in the 
next section we need to find the canonical forms, given by ([2]), of the bounds 
and then the computational complexity of constructing these representations is 
important. Focusing on the bound in (I15D . the max function is a function of 
dij = |{A e S'pj} : |A| = 3} variables, so we need to compute the values of 
2U interaction coefficients. This is computationally feasible only if dij is small 
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enough. If dij is too large we need to consider computationally cheaper, and 
coarser, bounds. To see how this can be done, first note that for any r e N\{i,j} 
and {i,j,r} e Spwe have •S’p j j.j c 5pj} and thereby the g(x) defined in (fTT)l 
can alternatively be expressed as 


g(x) = XiXjXr 2 j /3''' Yl 






E n 


AsS, 




{i,j ,t} 


ke^\{i,j}i 


(16) 

An alternative upper bound can then be defined by following a similar strategy 
as above, but for each of the two terms in (I16p separately. This gives the upper 
bound 


fix) ^ Yi 1 1j (n 

AeS{i,j,r} \ keA\{i,j,r} 


AeS' 


{id} 


keA 


(17) 


+Xi max -( 0, s n Xk 

j y /cgA\{2,j} 

and the corresponding lower bound is again given by the same type of expression 
except that the max operators are replaced by min operators. One should note 
the canonical form of the bound in (HZl) can be done for each max term sepa¬ 
rately. Moreover, as both max terms in (I17p are functions of strictly less than 
dij variables the computational complexity of finding the canonical representa¬ 
tion of the bound in (fTTp is smaller than the complexity of the corresponding 
operation for the bound in (IlSp . However, for one or both of the max functions 
in (ini), the task of transforming it into the canonical form may still be too 
computationally expensive. If so, the process of splitting a sum into a sum of 
two sums must be repeated. For example, if the first max function is problem¬ 
atic, one needs to locate an s e N\{i,j,r} so that {i,j,r,s} e and the 

sum over A e S^ij ^.] can be split into a sum over A e Spj,, and a sum over 
A e S^ij^j.}\S{ij,r,s} finding bounds as before. By repeating this process 
sufficiently many times one will eventually end up with a bound consisting of a 
sum of max terms that can be transformed to the canonical form in a reasonable 
computation time. 


3 MRFs and the variable elimination algorithm 

In this section we give a short introduction to binary MRFs. In particular we 
explain how the variable elimination algorithm can be applied to this class of 
models ar id poin t out i ts c omputa t ional limitation. For a general introduction to 
MRFs see Besae ( 1974 ) or Cressie ( 1993h and for more on the connection betwee n 
binary MRFs and pseudo-Boolean functions see Tielmeland and Austad ( 20121 ). 
For more on the variable elim ination algorithm ari d applications to MRFs see 
Reeves and Pettitt ( 2004 1 and Friel and Rue (2007). 
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3.1 Binary Markov random fields 

Assume we have a vector of n binary variables x = e = {0,1}", 

N = {l,...,n}. Let Af = {A/i,... ,Afn} denote a neighbourhood system where 
A4 denotes the set of indices of nodes that are neighbours of node k. As usual 
we require a symmetrical neighbourhood system, so if i 6 Afj then j e Mi, and 
by convention a node is not a neighbour of itself. Then x is a binary MRF with 
respect to a neighbourhood system M if p{x) > 0 for all x e 11 and the full 
conditionals p(xfc|x_fc) have the Markov property, 


p{xk\x_k) = p{xk\xMk) V X e H and ke N, (18) 


where xa/}. = {xi : i e Mk)- A clique A is a set K N such that for all pairs 
i,j e A we have i e Mj. A clique is a maximal clique if it is not a subset of 
another clique. T he set o f all nraximal cl i ques w e denote by C. The Hammersley- 
Clifford theorem (Besag, 1974 : Clifford . 199d l tells us that we can express the 
distribution of x either through the full conditionals in (1181) or through clique 
potential functions, 


p{x) = - exp{[/(x)} = - exp i V Ua{xa) i 

^ ^ IasC J 


(19) 


where c is a normalising constant, U\{x\) is a potential function for a clique A 
and XA = (xj : i e A). U{x) is commonly referred to as the energy function. 
From the previous sections we know that U{x) is a pseudo-Boolean function 
and can be expressed as, 


u{x)= 2 n =s n 

AcAT fcsA AsS fcsA 


( 20 ) 


where S is defined as in ([3|) . For a given energy function U (x), iTielmeland and Austad 
( 20121 ) show how the interactions can be calc ulated recursively by evaluating 
U{x). Moreover, ITielmeland and AustadI ( 2012l l show that = 0 whenever A 
is not a clique. From this we understand that it is important that we represent 
U{x) on S as defined in ([3]), and not use the full representation. 


3.2 The variable elimination algorithm 

As always the problem when evaluating the likelihood or generating samples 
from MRFs is that c is a function of the model parameters and in general 
unknown. Calculation involves a sum over 2" terms. 


c = ^ exp {U (x)} = ^ exp ( XI 11 

ieO xsQ, \AeS fcsA / 


( 21 ) 


The variable elimination algorithm (iR.eeves and Pettittl . 120041 : iFriel and Rue . 
20071 1 calculates the sum in (|2ip by taking advantage of the fact that we can 
calculate this sum more efficiently by factorising the un-normalised distribution. 
We now cover this recursive procedure. 
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Clearly we can always split the set S into two parts, and where 
i e N. Thus we can split the energy function in (|20p into a sum of two sums, 

U{x)= ^ /3'^]^XA:+ Yi 

keA AE5{ij fcsA 

Note that the first sum contains no interaction terms involving Xj. Letting 
X-i = (xi,..., Xj_i, Xj+i,..., Xn), we note that this is essentially equivalent to 
factorising p{x) = p(xj|x_j) p(x_j), since 


p(xj|x_j)oc exp 


Xk • 


(23) 


, AsS{i} ksA 


By summing out x* from p(x) we get the distribution of p(x_j). Taking advan¬ 
tage of the split in (|22p we can write this as. 




,A£S?., keA 
in 


I AsSji} keA 


The last sum over Xj can be expressed as the exponential of a new binary 
polynomial, i.e. 


exp ( Y 

VacA/I keA / 




(25) 


,AeS. 


{i} 


keA 


where the interactions can be sequentially calculated by ev aluating the sum 
over Xj in ()25p as described in iTielmeland and AustadI (120121 ). Note that this 
new function is a pseudo-Boolean function potentially of full degree. The num¬ 
ber of non-zero interactions in this representation could be up to 2l-^*L Summing 
out Xj leaves us with a new MRF with a new neighbourhood system. This is the 
hrst step in a sequential procedure for calculating the normalising constant c. 
In each step we sum over one of the remaining variables by splitting the energy 
function as above. Repeating this procedure until we have summed out all the 
variables naturally yields the normalising constant. 

The computational bottleneck for this algorithm occurs when representing 
the sum in (I25p . Assume we have summed out variables xi:i_i = (xi,..., Xj_i), 
have an MRF with a neighbourhood system M = {Mi,... and want to 

sum out Xj. If Mi is too large we run into trouble with the sum corresponding 
to ()25p since this requires us to compute and store up to 2-™* interaction terms. 
In models where Mi increases as we sum out variables the exponential growth 
causes us to run into problems very quickly. As a practical example of this 
consider the Ising model defined on a lattice. Assuming we sum out variables in 
the lexicographical order, the size of the neighbourhood will grow to the number 
of rows in our lattice. This thus restricts the number of rows in the lattice to 
less or equal to 20 for practical purposes. 
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4 Construction of an approximate variable elimina¬ 
tion algorithm 

In this section we include the approximation results of Section [2] in the variable 
elimination algorithm described in the previous section to obtain an approx¬ 
imate, but computationally more efficient variant of the variable elimination 
algorithm. To create an algorithm that is computationally viable we must seek 
to control |A/i| = r]i as we sum out variables. If this neighbourhood becomes 
too large, we run into problems both with memory and computation time. Our 
idea is to construct an approximate representation of the MRF before summing 
out each variable. The approximation is chosen so that r]i ^ u, where v is an 
input to our algorithm. Given a design for the approximation we then want to 
minimise the error sum of squares of our energy function. 

Assume we have an MRF and have (approximately) summed out variables 
xi-i-i, so we currently have an MRF with a neighbourhood structure M and 
energy function t7(xj:„) = ^^IlkeA^k, so, 

c = ^ exp {[/(xj:„)} . (26) 

^i:n 

If r]i is too large we run into problems when summing over Xj. Our strategy 
for overcoming this problem is first to create an approximation of the energy 
function tj{Xi:n), 

u{Xi:n) = 2 Xfc ^ t/ {Xi-n) = XI 0 

AeS keA AeS keA 

We control the size of rji, by designing our approximation set S and thus the 
new approximate neighbourhood J\f in such a way that \J\fi\ = fji ^ v. Assuming 
we can do this, we could construct an approximate variable elimination algo¬ 
rithm where we check the size of the neighbourhood rji before summing out each 
variable. If this is greater than the given v we approximate the energy function 
before summation. This leaves two questions; how do we choose the set S and 
how do we define the approximation? The two questions are obviously linked, 
however we start by looking more closely at how we may choose the set S. Our 
tactic is to reduce r/j by one at the time. To do this we need to design S in such 
a way that i and some node j are no longer neighbours. Doing this is equivalent 
to requiring all interactions /3^, involving both i and j to be zero. As before 
we denote the subset of all interactions involving i and j as ^ S and 

construct our approximation set as in Section [2.31 defining S = Our 

approximation is defined by the equations corresponding to ([5]) and using the 
results from Section 12.31 the solution is easily available. We can then imagine 
a scheme where we reduce r]i one at a time until we reach our desired size v. 
This leaves the question of how to choose j. One could calculate the SSE for all 
possibilities of j and choose the value of j that has the minimum SSE. However, 
this may be computationally very expensive and would in many cases dominate 
the total computation time of our algorithm. Instead we propose to compute an 
approximate upper bound for \f{x) — f{x)\ for all values of j and to select the 


11 


value of j that minimises this approximate bound. We dehne the approximate 
bound by hrst defining a modihed version f*{x) of /(x), where we set all first, 
second and third order interactions for f*{x) equal to corresponding quantities 
for f{x), and set all fourth and higher order interactions for f*{x) equal to zero. 
We then define the approximate upper bound as the exact upper bound for 
|/*(x) — f*{x)\. As f*{x) has no non-zero fourth or higher order interactions 
upper and lower bounds for f*{x) — f*{x) can readily be computed as discussed 
in Section 12.41 

One should note that Theorem [2] means that after reducing iji by iji — v our 
approximation is still optimal for the given selection of j’s. However, there is 
no guarantee that our selection of j’s is optimal, and it is possible that we could 
have obtained a better set of j’s by looking at the error from reducing T\i by 
more than one at the time. 

Using this approximate variable elimination algorithm we can define a corre¬ 
sponding approximate model through a product of the approximate conditional 
distributions, 

p(x) = p(xi|x2:n) ■ ■ ■ p(Xn-l\Xn)p(Xn), (28) 


which is a POMM ( Cressie and DavidsorJ . 1998h . One of the aspects we wish to 
investigate in the results section is to what extent this distribution can mimic 
some of the attributes of the original MRF. Clearly, to sample from p{x) is easy 
via a backward pass, first simulating Xn from p{xn), thereafter simulating Xn-i 
from p(xn-i\xn) and so on. One should note that two versions (|28)l can be 
dehned. The hrst version is obtained by taking p{xi\xi+i:n) to be the resulting 
conditional distribution after we have (approximately) summed out xi-i-i. In 
p{xi\xi+i-n) we then have no guarantee for how many of the elements in Xi+i-,n 
the variable Xi really depends on, and it may be computationally expensive to 
compute the normalising constant of this conditional distribution for all values 
of Xj+i:n- The alternative is to let p{xi\xi+i-n) be the resulting conditional dis¬ 
tribution after one has both (approximately) summed out xi-,i-i and done the 
necessary approximations so that Xi is linked with at most v of the elements in 
Xi+i-n- Then we know that Xi is linked to at most v of the variables in Xi+i-,n 
in the conditional distribution also, and we have an upper limit for the compu¬ 
tational complexity of computing the normalising constant in the conditional 
distributions. Which of the two versions of p{x) one should use depends on what 
one intends to use p{x) for. One would expect the first version to be the best 
approximation of p{x), but for some applications it may be computationally 
infeasible. We discuss this issue further in the examples in Section [71 


5 Bounds and alternative marginalisation operations 

In this section we consider some variations of the approximate algorithm defined 
above. We hst discuss how the results in Section [2.41 can be used to modify the 
procedure to get upper and lower bounds for the normalising constant. There¬ 
after we consider how approximations (or bounds) for some other quantities can 
be found by replacing the summation operation in the above algorithm with 
alternative marginalisation operations. 
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5.1 Bounds for the normalising constant 

The approximate value for the normalising constant, c, found by the algorithm 
in Section Incomes without any measure of precision. Using the results of Section 
I2.4l we can modify the algorithm described above and instead find upper or lower 
bounds for c. 

Our point of origin for finding a bound cl (or cu) such that cl ^ c (or 
cjj ^ c) is the approximate variable elimination algorithm described in the 
previous section. An iteration of this algorithm consists of two steps. First 
the energy function is replaced by an approximate energy function and, second, 
we sum over the chosen variable. To construct an upper or lower bound we 
simply change the first step. Instead of replacing the energy function by an 
approximation we replace it with a lower (or upper) bound. To dehne such 
a bound we adopt the strategy discussed in Section 12.41 Letting Xi denote the 
next variable to sum over, we first have to decide which second order interaction 
to remove, i.e. the value of j in Section [T4l For this we follow the same strategy 
as for the approximate variable elimination algorithm discussed above. Then we 
use the bound in m whenever dij ^ z/, where dij is as dehned in Section 12.31 
and v is the same input parameter to the algorithm as in Section U] If dij > v 
we use a coarser bound as discussed in Section |4] In the definition of these 
coarser bounds it remains to specify how to choose the value of r in (I17|) . and if 
necessary also the value of s and so forth. For dehniteness we here describe how 
we select the value of r, but follow the same strategy for all such choices. For 
the choice of r we adopt a similar strategy as for j in Section |4l but now include 
all interactions up to order four in the dehnition of f*{x). When selecting a 
value for s we include in f*{x) interactions up to order hve and so fourth. 


5.2 Alternative marginalisation operations 

The exact variable elimination algorithm hnds the normalising c onstant of an 


MRF by summing over each variable in turn. As also discussed in ICowell et al. 


([200?1) for the junction tree algorithm, other quantities of interest can be found 
by replacing the summation operation by alternative marginalisation operations. 
Two quantities of particular interest in our setting is to find the state x which 
maximises U (x), and to compute moments of x. In the following we first consider 
the maximisation problem and thereafter the computation of moments 


5.2.1 Maximisation 

By replacing the summation over Xi in the exact variable elimination algo¬ 
rithm with a maximisation over Xj, the algorithm returns the maximal value of 
exp{C/(x)} over x e Q. By a following backward scan it is also possible to find 
the value of x which maximises exp{17(x)}. The forward and backward passes 
are together known as the Viterbi (1967) algorithm. Just as summation over 
Xi becomes computationally infeasible if the number of neighbours to node i 
is too large, the maximisation over Xi is also computationally infeasible in this 
situation. 
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To see how to construct an approximate Viterbi algorithm, recall that our 
approximate variable elimination algorithm consists of two steps in each itera¬ 
tion. First to approximate the energy function and then to sum over a variable. 
To get an approximate Viterbi algorithm we simply replace the second step, so 
instead of summing over a variable we take the maximum over that variable. 
Note that a lower or upper bound for exp{17(x)} can be found by replacing the 
approximation with a lower or upper bound as discussed in Section 15.11 

5.2.2 Moments 

Consider the problem of computing a moment E{i/'(a;)}, where x is distributed 
according to an MRF p(x), and ijj^x) is a given function of x 6 fl. Inserting the 
expression for p{x) in (I19p we get 

E{V’(x)} = - 2 exp{C/(x) -I- ln'0(x)}. (29) 

xgQ 

Thereby E{'!/)(x)} can be found by running the exact variable elimination algo¬ 
rithm twice, first as described in Section 13.21 to find c and thereafter with the 
energy function redefined as U{x) + ln'0(x) to hnd the sum in (I29p . In gen¬ 
eral the computational complexity of the second run of the variable elimination 
algorithm is much higher than for the first run, but if the pseudo-Boolean func¬ 
tion '4>{x) can be represented on the same set S as the energy function U{x) 
both runs are of the same complexity. We obtain an approximation to Fl{^jJ{x)} 
simply by adopting the approximate variable elimination algorithm instead of 
the exact one. To obtain a lower (upper) bound for Fl{^l){x)} we can divide a 
lower (upper) bound for the sum in (1291) with an upper (lower) bound for c. 

6 Some implementational issues 

When implementing the exact and approximate algorithms discussed above 
we need to use one (or more) data structure(s) for storing our representation 
of pseudo-Boolean functions. The operations we need to perform on pseudo- 
Boolean functions fall into two categories. The first is to compute and store all 
interaction parameters of a given pseudo-Boolean function without any (known) 
Markov structure. The computation of the interaction parameters /3^ defined 
by (I25p is of this form, and so is the corresponding operation for the max terms 
in m and (HZP. When doing this type of operations we are simply num¬ 
bering all the interaction parameters in some order and storing their values 
in a vector. The values are then fast to assess and the necessary computa¬ 
tions can be done efficiently. The second type of operation we need to do 
on a pseudo-Boolean function consists of operations on functions defined as 
in ([2|), functions for which a lot of the interaction parameters are zero. The 
approximation operation defined in Theorem [5] is of this type. We then need 
to adopt a data structure which stores an interaction parameter /3^ only if 
A e S. We use a directed acyclic graph (DAG) for this, where we have one 
node for each A e S and a node A e S is a child of another node A e S' if and 


14 



Figure 1: The directed acyclic graph used to represent a pseudo-Boolean func¬ 
tion represented on 5 = {0, {!}, {2}, {3}, {4}, {1, 2}, {1, 3}, {2,3}, {1, 2, 3}}. 


only if A = A u {z} for some i e c A. An illustration for N = {1,2,3,4} 
and S = {0,{1}, {2}, {3}, {4}, {1,2}, {1,3}, {2,3}, {1,2, 3}} is shown in Figure 
[T] The value of is stored in the node A, and the arrows in the figure are 
represented as pointers. To assess the value of an interaction parameter (3^ in 
this data structure we need to follow the pointers from the root to node A. This 
is clearly less efficient than in the vector representation discussed above, but 
this is the cost one has to pay to reduce the memory requirements. One should 
also note that such a DAG representation is very convenient when computing 
the approximation defined by Theorem [5j What one needs to do is first to clip 
out the subgraph which corresponds to <S'pj}. Thereafter one should traverse 
that subgraph and for each node in the subgraph add the required quantity to 
three interaction parameters in the remaining DAG, as specified by (llOp . 


7 Simulation and data examples 

In this section we first present the results of a number of simulation exercises to 
evaluate the quality of the approximations and bounds. Thereafter we present 
some simulation examples to demonstrate possible applications of the approxi¬ 
mations and bounds we have introduced. Finally, we use our approximation in 
the evaluation of a data set of cancer mortality from the United States. In all 
the examples we adopt the approximation and the bounds defined in Sections [4] 
and [5j 


7.1 Models 


In the simulation examples we consider tw o classes of MRFs. The first class 
we consider is the Ising model ( Besae . 1986l l. The energy function can then be 
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(b) 


Figure 2: (a) The third-order neighbourhood structure used in the higher-order 
interaction MRF. The white nodes are neighbour to the black node, (b) The 
corresponding two types types of maximal cliques. 
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Model 1 

0.5 

0.0 

0.0 

-1.0 

0.0 

-1.5 

0.0 

0.0 

-0.5 

-0.5 

Model 2 

0.75 

0.0 

0.0 

-1.5 

0.0 

-2.0 

0.0 

0.0 

-1.0 

-1.0 


Figure 3: Potential values for the various clique configurations in the higher- 
order MRF models. The potentials are invariant under rotation, reflection and 
inversion of the colours. 


expressed as 


U{x) = eYjI{xi = Xj), 


(30) 


where the sum is over all first order neighbourhood pairs, 0 is a model parameter, 
and I{xi = Xj) is the indicator function and takes value 1 \i Xi = Xj and 0 
otherwise. We present results for 9 = 0.4, 0.6, 0.8 and — ln(V2 — 1), where the 
last value is the critical value in the infinite lattice case. Representing U{x) as a 
binary polynomial is done by recursively calculating the f irst and second order 
interactions, for details see lTielmeland and AustadI (|2012h . 

In the second class of MRFs we assume a third order neighbourhood. Then 
each node sufficiently far away from the borders has 12 neighbours and there 
exists two types of maximal cliques, see Figure [2j In contrast to the Ising 
model this model includes interactions of higher order than two, and we denote 
it the higher-order interaction MRF. For the two types of maximal cliques we 
adopt potential functions that are invariant under rotation, reflection and when 
interchanging 0 and 1. With these restrictions the potential functions for the 
2x2 and five node cliques can take four and six values, respectively. We define 
two models of this type and the potentials for the various clique configurations 
are given in Figure [3l A realization from each of the models, generated by Gibbs 
sampling, is shown in Figured 


7.2 Empirical evaluation 

In this section we first consider the quality of the approximation of the normal¬ 
ising constant c and the corresponding bounds. Thereafter we evaluate to what 


16 


























Model 1 


Model 2 


Figure 4: Realizations from the two higher-order MRF models on a 100 x 100 
lattice, generated by Gibbs sampling. Potential functions for the models are 
defined in Figure [3j 




Figure 5: Results for the Ising model: The upper row shows the approximation 
ln(c) (solid) and the corresponding upper and lower bounds ln(cf/) and In(cL) 
(dashed) for v = 8,..., 18, and the lower row shows ln(c( 7 ) — In(cL) for v = 1 
to 18. 

extent our p{x) defined in Section |4] can be used as an approximation to the 
corresponding p{x). 

We first compute the approximate normalising constant c and the corre¬ 
sponding upper and lower bounds cjj and cl for v = 1, 2,... , 18 for each of our 
four 9 values. The results are presented in Figured The upper row shows the 
approximation ln(c) together with the bounds In(ci) and ln(c; 7 ) for i/ = 8 to 
19, whereas the lower row shows ln(cf/) — In(cL) for v = 1 to 19. For a given 
value of i', computation of an approximation takes about the same time for all 
values of 6, and computation of a bound takes about the same time as evaluat¬ 
ing the corresponding approximation. The computation time for one evaluation 
as a function of v is shown in Figure [6Ka). Figure [7] shows similar results for 
the higher-order MRF models, again for a 100 x 100 lattice, and correspond¬ 
ing computation times are shown in Figure |6Kb). Not surprisingly, we see that 
the quality of the approximation and bounds are best for models with weak 
interactions. 
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Figure 6: The natural logarithm of the computation times (in seconds) used 
to compute an approximation or a bound in (a) the Ising model, and (b) the 
higher-order interaction MRF. 
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Figure 7: Results for the higher-order MRF model: The upper row shows the ap¬ 
proximation ln(c) (solid) and the corresponding upper and lower bounds ln(c( 7 ) 
and ln(ci;,) (dashed) for z/ = 6,..., 18, and the lower row shows ln(c( 7 ) — In(ci) 
for = 1 to 18. 


Next we evaluate the quality of the POMM approximation p{x) given in 
(j28|l . still on a 100 x 100 lattice. To do this we consider an independent proposal 
Metropolis-Hastings algorithm where the MRF p{x) is the target distribution 
and p{x) is used as proposal distribution. We use the acceptance rate in such an 
algorithm to measure the quality of the approximation. It should be emphasised 
that we do not propose this Metropolis-Hastings as a way to sample fromp(3:), 
we just use the acceptance rate of this algorithm to measure the quality of our 
approximation. It should be noted that in this evaluation test we do not need 
to compute the normalising constant of the conditional distributions in (I28p for 
all values of the conditioning variables, so we apply the first (and best) POMM 
approximation variant discussed in the paragraph following (I28p . 

To estimate the quantity we generate 1000 independent samples from p{x) 
and corresponding 1000 independent samples fromp(x), compute the Metropolis- 
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Figure 8: Estimated acceptance rates, for u = 1,..., 18, for an independent pro¬ 
posal Metropolis-Hastings algorithm with target distribution given by the MRF 
p{x) on a 100 x 100 lattice and proposal distribution given by the corresponding 
POMM approximation p{x) in (j28|) . (a) Results for the Ising model, from top 
to bottom the curves are for 6 = 0.4, 0 = 0.6, 0 = 0.8 and 0 = — ln(v^— !)• (b) 
Results for the higher-order MRFs, the upper and lower curves are for Model 1 
and Model 2, respectively, defined in Figures [2] and [3l 


Hastings acceptance probability for each pair and use the average of these num¬ 
bers as our estimate. For the Ising model we generate perfect samples from p{x) 
on a 100 X 100 lattice by first sam pling perfectly from the du al random cluster 
model by coupling from the past ( Propp and Wilson . lOOOl ). For the higher- 
order MRF model we are not able to generate perfect samples, so we instead 
run a long Gibbs sampling algorithm, and obtain (essentially) independent real¬ 
izations by sub-sampling this chain. The results for the Ising and higher-order 
MRF models are given in Figure [HI For 0 = 0.4 and 0.6 in the Ising model 
we see that we get very good approximations even for quite small values of z/. 
For 0 = 0.8 a large values for zz is needed to get high acceptance rate. For 
0 = — ln(-\/2 — 1) the acceptance rate ends at 30% for v = 18, and remembering 
that this is for a block update of 100 x 100 variables we think this is quite im¬ 
pressive. For the higher-order MRFs, the acceptance rate for Model 1 becomes 
very high for the higher values of zz, whereas the results for Model 2 resemble 
the results for 0 = — — 1) in the Ising model. 


7.3 Some possible applications 

In this section we present some simulation examples that demonstrate some 
possible applications of the proposed approximations and bounds. All the ex¬ 
amples are for MRFs defined on a rectangular lattice, but similar applications 
for MRFs defined on graphs are of course also possible. 


7.3.1 Maximum likelihood estimation 

Assume we have observed an image x which we suppose is a realization from an 
MRF p{x\0), where 0 is a scalar parameter. The p{x\0) may for example be the 
Ising model. If we want to estimate 0 the maximum likelihood estimator (MLE) 
is a natural alternative. As discussed in the introduction the computation of 
the MLE is complicated by the intractable normalising constant of the MRF. 
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Figure 9: Maximum likelihood example: The identihcation of an interval in 
which the MLE for 6 in an Ising model must be located. The estimation is based 
on a sample from the Ising model with 0 = 0.6. The plots shows computed upper 
and lower bounds for the log-likelihood function for different values of v. The 
horizontal dotted line is the maximum of the computed lower bounds and the 
two vertical dotted lines dehnes the interval in which the maximum likelihood 
estimate must be located. 

Figure [9] illustrates how upper and lower bounds for the normalising constant 
can be used to identify an interval in which the MLE for 9 must be located. To 
produce the curves in the hgure we hrst simulated an x from the Ising model 
with parameter 9 = 0.6, on an 100 x 100 rectangular lattice. Assume we want 
to hnd the MLE of 9 based on this x. We hrst computed upper and lower 
bounds for the log-likelihood function by replacing the intractable normalising 
constant with corresponding upper and lower bounds with u = 2 for 11 values 
of 0 on a mesh from 0 to 2. The maximum log-likelihood value must clearly 
be higher than the maximum of the lower bound values. Assuming the log- 
likelihood function to be concave we can then identify an interval in which the 
MLE of 9 must he. Dehning a new mesh of 11 0 values over this interval we 
repeated the process for v = A and obtain an improved interval, see the upper 
middle plot in Figure [9l We then repeated this process for z/ = 6,8,..., 18, and 
the hnal interval for the MLE shown in the lower right plot in Eigure[9]was 
(0.6055,0.6123). It is important not to confuse this interval with conhdence 
or credibility intervals, the interval computed here is just an interval which, 
with certainty, includes the MLE of 9. It should be noted that computations 
of the 2-11 = 22 bounds for the normalising constants can be done in parallel, 
and for larger values of v this is essential for the strategy to be practical. In 
corresponding simulation experiments with true 9 values equal to 0.4, 0.8 and 


20 















































Figure 10: Maximum posterior estimation example: True scene used in the 
maximum posterior estimation and fully Bayesian model examples. 


— ^(^2 — 1) we obtained the final intervals (0.4113,0.4115), (0.7716,0.8490) 
and (0.8234,0.9250), respectively. Our bounds are less tight for higher values of 
9 and naturally this gives longer intervals for the MLEs when the true 6 value 
is larger. 

7.3.2 Maximum posterior estimation 

The approximate Viterbi algorithm discussed in Section 15.2.11 can be used in 
image analysis applications. Suppose the constructed scene in Figure fTOT al. 
which is in an 89 x 85 lattice and which we denote by x, is an unobserved true 
scene, and assume that a corresponding noisy version y of x is observed. Here we 
will use a y generated from x by drawing each element of y independently from 
a normal distribution with unit standard deviation and with means 0 and 1 for 
the white and black areas of x, respectively. Assuming the likelihood parameters 
to be known, and assigning an MRF prior p(x) to x, we can estimate x from y 
by maximising the posterior distribution with respect to x, i.e. 

X = argmax {p(x)p(y|x)}. (31) 

X 

To compute x is computationally intractable, but by adopting the procedure 
discussed in Section [5. 2. II we obtain an approximation to x. We have done this 
for the six priors defined above and for values of v from 1 to 18. The scenes 
in Figure [11] show the results for v = 18. It is interesting to observe that the 
results for the Ising prior with /3 = 0.6 and 0.8 are very similar to the results 
for the higher-order interaction MRFs Models 1 and 2, respectively. 

Comparing the results for different values of v we find that for the Ising 
model with /3 = 0.4 the results are identical for all = 5 to 18. For the other 
seven priors the resulting scenes continued to vary slightly for all the values we 
used for v, but in all cases the differences became smaller for higher values of 
For example, for all these seven priors less than 0.6% of the nodes were assigned 
different values when z/ = 16 and z^ = 18 were used. When comparing the 
results with the true scene the Ising model with /3 = 0.8 has the lowest number 
of misclassifications, with the higher-order MRF Model 2 slightly behind. 
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Figure 11: Maximum posterior estimation example: Approximations to x when 
ly = 18 for the four Ising priors (upper row) and the two higher order MRFs 
(lower row). 

7.3.3 Fully Bayesian modelling 

Assume again that we have observed a noisy scene y corresponding to a un¬ 
observed true image x. An alternative to the strategy for estimating x from y 
discussed above is to adopt a fully Bayesian approach. Here we consider the 
same simulated y as we did in Section 17.3.21 Then we consider the unobserved 
X to be a sample from an MRF prior p{x\9), where 0 is a parameter vector. For 
p{x\9) we try both the Ising prior (j30p and the higher order MRF defined in 
Section [TTI The higher order MRF has the ten parameters specified in Figure [3l 
but to make the model identifiable we fix one parameter corresponding to each 
of the two maximal cliques. We set the potentials to zero for the configurations 
where all nodes in a maximal clique are equal. Thus, 9 gets eight elements in 
this prior. To fully specify the Bayesian model we need to adopt a parametric 
form for the likelihood p{y\x, p), where 99 is a parameter vector, and to specify 
priors for 9 and p. For the likelihood we assume the product of normals given 
in Section 17.3.21 and p contains a mean value and a standard deviation for each 
of the two possible values of the elements in x, i.e. (p = [po, pi,ao,ai]. To 
avoid problems if all elements of x are assigned to the same value we need to 
adopt proper priors for the likelihood parameters. For po and pi we use inde¬ 
pendent normal priors with zero mean and standard deviation ten, but to make 
the model identifiable we add the restriction pQ ^ pi. For (Tq , <ti ^ 0 we a priori 
assume independent exponential densities with means equal to ten. 

To simulate from the resulting posterior p{x, 9, p\y)a:p{9)p{ip)p{x\9)p{y\x, p) 
is computationally infeasible due to the normalising constant of p{x\9), but 
we can simulate from the approximation p{x,9,p\y)ccp{9)p{p)p{x\9)p{y\x,p), 
where p{x\9) is defined as in fl28)l and here we adopt the second POMM approx¬ 
imation variant discussed in the paragraph following (1281) . 

. To simulate from p{x,9,p\y) we adopt a Metropolis-Hastings algorithm 




-ln(v^ - 1) 
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Figure 12: Fully Bayesian model example: Results for a fully Bayesian model, 
using an Ising prior (upper row) and the higher-order interaction MRF prior 
(lower row). The leftmost column shows the maximum marginal posterior prob¬ 
ability estimate of the underlying scene. The two following columns show esti¬ 
mated posterior distributions for and respectively. The rightmost column 
shows estimated posterior distribution for 6 for the Ising prior and for the eighth 
parameter in Figure [3] for the higher-order interaction prior. 


and alternate between single site updates for the likelihood parameters and the 
for the components of x and a joint block update for the MRF parameters and 
X. To do Gibbs updates for the likelihood parameters is not feasible, but we use 
proposals distributions which are close to the full conditionals. More precisely, 
for each of /tq ^md //i we find the full conditionals when ignoring the restriction 
fiQ < fii and use these as proposals distributions, and for do and cxi we propose 
potential new values by proposing values for the corresponding variances from 
their full conditionals when a very vague inverse gamma prior is assumed. The 
proposals are then accepted of rejected according to the standard Metropolis- 
Bastings procedure. In practice essentially all proposals are accepted. In the 
single site updates for the components of x we generate the potential new state 
by changing the value of a randomly chosen element in x, and in the block 
update we use a proposal distribution 

q{e', x'\9, ip, x) = q{9'\0)p{x'\y, 6', p'), (32) 

where q{9'\0) is a random walk proposal for 9, and p{x'\y,9',p') is the POMM 
approximation of the posterior MRF p{x'\y,9',p')a:p(x\9')p{y\x',p'). In the 
random walk proposals for 9 we sample the elements independently from normal 
distributions centered at the current values and with all standard deviations 
equal to 0.1 both for the Ising and higher-order MRF prior cases. We define 
one MB iteration to consist of one update for each of the likelihood parameters, 
89 • 85 single site updates for elements of x and one block update as defined 
above. Using = 8 to define the POMM approximations, Figure W2\ summarises 
some simulation results. The left column shows the estimated true scene using 
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the marginal posterior probability estimator for each of the two priors. We 
observe that the higher-order MRF looks less noisy, and the fraction of wrongly 
classified values are 4.7% and 4.2% for the Ising and higher-order MRF priors, 
respectively. The corresponding means of the marginal posterior probabilities 
for the wrong class labels are 0.102 and 0.083, respectively. Thus, the posterior 
probability for the wrong class is on average 18.5% lower when using the higher- 
order MRF prior than with the Ising prior. The two middle columns in Figure fT^ 
show the estimated posterior distributions for /tq and fj-i and we observe that the 
true values, zero for the second column and one for the third, are far out in the 
tail of the posterior when using the Ising prior and more centrally located when 
using the higher-order MRF. All of this demonstrates the potential advantage 
of using a higher-order MRF prior. The last column in Figure [12] shows the 
estimated posterior distribution for 6 when using the Ising prior, and the for 
the eighth parameter in Figure [3] for the higher-order interaction prior. We 
observe that the variance in the posterior for 0 is quite small, but its value 
is also well above the phase transition limit. The posterior variability for the 
parameter in the higher-order interaction prior, however, is very large, and the 
same is true for the other parameters in this prior. This may indicate that the 
higher-order interaction prior is over-parameterised and that better results could 
perhaps have been obtained by adopting a prior with fewer parameters. The 
best alternative would perhaps be to put a prior also on the interaction order 
of the model, so that the complexity of the model could adapt automatically to 
the problem in focus. 


7.3.4 Perfect sampling by rejection sampling 

The last application of our approximation and bounds we discuss is how it can 
be used to construct a rejection sampling algorithm generating perfect samples 
from some MRFs p{x). Let p{x) = ^ exp{[/(x)} be a given binary MRF, and let 
p{x) denote the corresponding POMM approximation. One can then imagine 
a rejection sampling algorithm generating candidate samples from p{x) and 
accepting a candidate x with probability 


a{x) = k ■ 


P{x) 

p{x) 


~ exp{U{x)} 
pix) ’ 


(33) 


where A; is a constant so that a{x) ^ 1 for all x, and k = k/c. Clearly the 
optimal choice for k is 

fcopt = mjn j , (34) 

but to compute k^pt is computationally intractable. However, noting that In {p(x)e“^i^i} 
is a pseudo-Boolean function we can find a lower bound In Abound ^ In/Copt as 
discussed in Section 17.3.21 and use k = Abound in dM]) • R should be noted this 
procedure implies that we need to compute the normalising constant of the con¬ 
ditional distributions in ()28p for all values of the conditioning variables, so when 
defining the POMM approximation p{x) we must use the second approximation 
variant discussed in the paragraph following (I28p . 
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Figure 13: Real data example: U.S. liver and gallbladder cancer mortality map 
for white males during 1950-1959. Black and white squares denote counties 
with high and low cancer mortality rates, respectively. The outer box shows the 
boundary of the extended lattice. 


The acceptance rate of such a rejection sampling procedure will depend both 
on the quality of the approximation p{x) and the ratio febound/^opt ■ One should 
note that the goal in this setting is not to get a very high rejection sampling 
acceptance rate, but rather to find a good trade off between the acceptance rate 
and the computation time for generating the proposals, for example by Ending 
the smallest value of that give an acceptance rate above a given threshold. We 
have tried the procedure on the posterior distributions defined in Section [7.3.21 
For the posteriors based on Ising priors with /3 = 0.4, 0.6, 0.8 and — ln('\/2 — 1) 
we needed i/ = 5, 7, 9 and 11, respectively, to obtain acceptance rates above 
0.1, and to get acceptance rates above 0.5 the corresponding values for ly are 
6, 8, 11 and 13. Perfect samples from these four posterior distributions can 
of course alternativel y be g enerated by the coupling from the past procedure 
of ProDP and Wilson ( 1996l f . As the rejection sampling procedure requires an 
initiation step of establishing p{x) and computing Abound) coupling from the past 
is the most efficient alternative whenever only one or a few independent samples 
are required, but if a large number of samples are wanted the rejection sampling 
algorithm becomes the best alternative. We have tried the rejection sampling 
procedure also for the posteriors when adopting the higher order models de- 
hned in Section rrn as priors, but for this prior we were not able to get useful 
acceptance rates within reasonable computation times. 


7.4 Real data example 

In this section we c onsider a United States cancer mortality map compiled by 
Rjggan et al.l (|l987n . Figure [13] shows the mortality map for liver and gallblad¬ 
der cancers for white males in 1950-1959, where black and white squares denote 
counties with high and l ow cancer mort a lity r ates, respect i vely. The data set 
is previously analysed in jsherman et Zl (1200611 and iLiangl (I 2 OIOII , see also the 
discussion in iLiang et al.l (I 2 OIII ) . In these studies a free boundary autologistic 
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Figure 14: The real data example: The six possible configurations in a 2 x 2 
clique (up to rotation), and the corresponding values of the potential function 
in the 3x3 neighbourhood MRF. 

model is compared with a model where the values in the nodes are assumed to 
be independent, and the conclusions in both studies are that the spatial model 
gives a much better fit to the data than the independence model. 

We define a fully Bayesian model for the data set and a priori assign equal 
probabilities for three possible models. The two hrst models are the indepen¬ 
dence and autologistic models adopted in the previous studies, whereas the third 
model is an MRF with a 3 x 3 neighbourhood and higher-order interactions. To 
reduce the effect of the boundary assumption we include the observed nodes in a 
larger rectangular lattice as illustrated in Figure [131 a-nd adopt a free boundary 
assumption for the extended lattice. We let 2 denote a vector of the observed 
values and y a vector of the unobserved values, and let x = {y, z) be a vector of 
all the values in the extended lattice. The dimensions of the rectangular lattice 
is chosen so that every observed node is at least 10 nodes away from the border, 
and the extended lattice is then 78 x 87. Of the nodes in the extended lattice 
34.6% is observed. In the following we first give a more precise specification of 
the three possible models we allow on the extended lattice and thereafter define 
prior distributions for the parameters in these models. 

The independence model has only one parameter, which we denote by 6, 

and 


pi(x|0)ocexp [ 6 


(35) 


ieS 


In the autologistic model we assume a first-order neighbourhood and assume 
the horizontal and vertical interactions to be equal. The model then has two 
parameters, which we denote by 6q and 6i and set 6 = {6q,6i). Our MRF is 
then given by 


P2(a:|6')ocexp 1 6*0 ^ /(x* ^ xj) + 6*1 ^ 7(xi = 1 n Xj = 1) 


(36) 






In the 3x3 neighbourhood MRF the maximal cliques are 2x2 blocks of nodes. 
Also for this model we assume the potential function Va(xa) to be invariant 
under rotation of the values in xc- The rotational invariance restriction groups 
the 2^'^ = 16 possible configurations of xc into the six groups illustrated in 
Figure [m Without loss of generality we can set the potential value for the all 
zero configuration to zero, and we define one parameter for each of the other 
five configuration sets, again as shown in Figure [T4l The parameter vector in 
the model is thereby 9 = [6 ^,..., 0^) and the MRF is given as 


P3(x|6')ocexp j ^ Ua{xa]0) 

UeC 


(37) 
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where U\{x\-,9) is as specified in Figure [T^ 

As specified above we assume a prior probability p{m) = 1/3 for each model 
m = 1,2,3. Given any of the three models we let the prior for the associated 
parameters, Pmi9), be a Gaussian distribution where the components of 6 are 
independent and Gaussian distributed with zero mean and some variance cr^. 
A priori we do not expect very strong interactions for this type of data so we 
set = 1. The resulting posterior distribution of interest is p{m,6\z), but the 
unobserved vector y makes it hard to simulate from this distribution so in the 
simulation we focus on p{m, 9, y\z)azp{m)pmi9)pm{y, z\9). For m = 2 and 3 the 
MRF pm{y,z\9) of course contains a computationally intractable normalising 
constant, so as in Section [7.3.31 we replace Pm{y,z\9) with a corresponding ap¬ 
proximation pm{y,z\9) as defined in (f28l) . As in the fully Bayesian example in 
Section 17.3.31 we here use the second POMM approximation variant discussed 
in the paragraph following (I28p as this alternative is somewhat faster than the 
alternative POMM approximation. 

The distribution we want to sample is then 


p{m, 9, y\z)ccp{m)pm{9)pm{y, z\9), 


(38) 


and to s ample f rom t his distribution we adopt a reversible jump MGMG al¬ 
gorithm (jGreenl . Il995l ) . To get an acceptable trans-dimensional mixing in the 
MGMG algorithm we found it necessary to include more auxiliary variables. 
For each of the three models k = 1,2,3 we added a parameter 9^ and vector 
yk, where 9i e M, 02 £ 1^^ and 03 e and each of the y^’s are vectors of the 
same size as y. We assume {9k, yk) ~ Pk{9k)Pk{yk\9k,z) independently for each 
k and independent of {m,9,y), where Pk{yk\(^k, z) is the POMM approximation 
oi Pk{yk\9k, z)a:pk{yk, z\9k). To simulate from 


p(m,0,y, 01,^1,02,^2,03,^31^) =p("i>0>y|^) 


k=l 


we adopt the reversible jump setup with three types of proposals. The first 
proposal is a random walk proposal for each of 0, 0i, 02 and 03 . A separate 
update is performed for each of these four variables, so a proposed new value for 
one of them is accepted or rejected before a change for another of the vectors is 
proposed. The components in the potential new vector are generated indepen¬ 
dently from Gaussian distributions centered at the current value and with the 
same variance for all components. By trial and error we tuned the value of 
the proposal variance and ended up using r = 0.025. The second proposal type 
generate potential new values for each of y, yi, y 2 and y^. Again a proposal 
followed by an acceptance or rejection is done for each of the vectors separately. 
For yk,k = 1, 2,3 the potential new value is generated from Pfc(yfc|0fc, z), and the 
potential new value for y is generated from Pm{y\9m,z). One can note that the 
updates for yi, y 2 and y 3 are Gibbs updates, whereas the update for y is not. 
The acceptance rate for the y updates is, however, very close to one. The third 
update type is the only trans-dimensional update. Here new values are proposed 
for (m,0,y) and for one of (0i,yi), (02,y 2 ) and ( 03 ,y 3 ). First a new value m' 
is proposed for the model indicator m. If m = 1 or 3 we always set m' = 2, 
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Figure 15: Real data example: For the Markov chain simulation starting at 
m = 1, trace plot of the simulated values for m. 


and if m = 2 we set m' = 1 or 3 with probabilities a half for each. Thereafter 
potential new values for 9 and y is deterministically set as 6' = 9m> and y' = ym' ■ 
Finally potential new values for Om' and are sampled as 6'^, ~ N(0m')'r^d) 
and y'^i ~ Pm' ■, z)^ respectively. Here we use the same variance as 
in the random walk proposals discussed above, and I is the identity matrix of 
the suitable dimension. The proposed new values are then accepted or rejected 
according to the standard reversible jump MCMC acceptance probability. In 
particular it is easy to show that the Jacobian determinant occurring in the 
expression for the acceptance probability equals unity for this kind of proposal. 

To check the convergence and mixing properties of the reversible jump 
MCMC algorithm we ran three independent runs with different starting val¬ 
ues. For each of A: = 1,2 and 3 we had a chain where we initially set m = k and 
ran 200 iterations without the transdimensional move. The idea here is that the 
chains should essentially converge conditional on the value of m before we allow 
the value of m to vary. After these 200 initial iterations we ran the three chains 
for an additional 4300 iterations, now including the transdimensional move. The 
chain starting with m = 1 then quickly changed to having m = 2 and m = 3 
and never returned to the state m = 1. The chains starting with m = 2 and 
m = 3 never visited the state m = 1. All three chains had several jumps back 
and fourth between m = 2 and m = 3 as can be seen from Figure [TSl In the 
analysis of the Markov chain runs we discard the first 500 iterations of each run 
as burn-in, and use all three runs to estimate the quantities of interest. We 
estimate the posterior probability for model m = 3 by the fraction of times the 
chains are visiting this state, and get p{m = 3\z) = 0.680. To evaluate the 
uncertainty in this number we split the chains into intervals of 100 iterations 
and estimate p{m = 3|z) based on each of these intervals. These estimates are 
close to being uncorrelated so, considering them as independent, we use them 
to construct an approximate 95% credibility t-interval for p{m = 3|^;). The 
interval becomes [0.619,0.741], clearly showing that m = 3 is the model with 
highest a posteriori probability. This demonstrates that even for data sets where 
the interactions are quite weak there can be a need for including higher-order 
interactions. Figures [TBl and [T71 show estimates of the marginal distributions for 
the model parameters for model m = 2 and m = 3, respectively. One should 
note that the spread of all of these parameters are well within the variability of 
the corresponding standard normal prior, so our prior here should not be very 
influential on these results. We also observe that all parameters, both for model 
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Figure 16: Real data example: Histograms of the simulated posterior values for 
9q and 9i when m = 2. 
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Figure 17: Real data example: Histograms of the simulated posterior values for 
00, • • •, 04 when m = 3. 


29 

















































































m = 2 and 3, are significantly smaller than zero. This reflects that zero is the 
dominating value of the data set and that we in both models have chosen to set 
the potential for maximal cliques with only zero values to the reference value of 
zero. 


8 Closing remarks 


In this report we have shown how we can derive an approximate forward- 
backward algorithm by studying how to approximate the pseudo-Boolean energy 
function during the summation process. This approximation can then be used 
to work with statistical models such as MRFs. It allows us to produce ap¬ 
proximations and bounds of the normalising constant and likelihood for models 
that would normally be too computationally heavy to work with directly. It 
also gives POMM approximations of MRFs that can be used as a surrogate for 
MRFs in more complicated model setups. We have demonstrated the accuracy 
of the approximation and bounds through simple experiments with the Ising 
model and a higher-order interaction model, and demonstrated some potential 
applications by simulation experiments. We have also applied our approxima¬ 
tion to a real life data set and here demonstrated that higher-order interactions 
may be important even for data sets with weak interactions. We round off now 
with some possible future extensions as well as some closing rem arks. _ 


Th e approximation we have defined was inspired by the work in iTiehneland and Austad 
( 20121 ) and there are many parallels between the two. There the energy func¬ 
tion was represented as a binary polynomial and small interactions was dropped 
while running the forward-backward algorithm. This worked reasonably well, 
but one may worry that dropping many smalls interactions may produce an ap¬ 
proximation no better than when dropping one large i nteraction. Moreover, for 
model s with strong interactions the approximation in iTielmeland and Austad 
( 2012 ! ) would either include too many terms and thus explode in run-time, or if 
the cutoff level was set low enough to run the algorithm, exclude so many of the 
interactions that the approximation became uninteresting. In a sense the work 
in this report has been an effort to deal with these issues. We have a clearly 
defined approximation criterion and the construction of the algorithm allows for 
much more direct control over the run time. Also, by not just dropping small 
terms, but approximating the pseudo-Boolean function in such a way that we 
minimise the error sum of squares we manage to get better approximations of 
the models with stronger interactions. 

In our setting the sample space of the pseudo-Boolean function has a proba¬ 
bility measure on it, however in our discussion of approximating pseudo-Boolean 
functions we have considered each state in the sample space as equally impor¬ 
tant. Assuming we are interested in approximating the normalising constant, 
this is probably far from optimal and is reflected in our results. As we can see 
for the Ising model our approximation works better the smaller the interaction 
parameter 9. Our initial approximation attempts to spread the error of remov¬ 
ing interactions as evenly as possible among the states. Ideally, we would like 
to have small errors in states with a corresponding high probability, and larger 
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errors in states with low probability. To get this approximation we should re¬ 
place the SSE in (jT|) with a weighted error sum of s quares. Thi s prob l em ha s 


(I2008I . l2nidi . 


also been studied in the literature, see for instance iDing et al. 

However, unlike the unweighted case an explicit solution is not readily available 
for a probability density like the MRF. The iterative method of removing inter¬ 
actions does not work here, nor can we group the equations like we do with the 
SOIR approximation. 

In all our examples we consider MRFs defined on a lattice and assume the po¬ 
tential functions to be translational invariant. Our approximations and bounds 
are, however, also valid for MRFs defined on an arbitrary graph and applications 
of the approximation and bounds in such situations is something we want to 
explore in the future. 
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A Proof of Theorem [3] 

Expanding SSE(/, /) = Y^xen {/(^) ~ /(^)} get, 

2 {fix) - fix){ = Yj {fix) - fix) + fix) - /(x)j 

= Z! {fi^) ~ fi^)} + Z {fi^) ~ fi^)} + Z ifi^) ^ fix)}{fix) - fix)} 

xgQ, xgQ xgQ 

= SSE(/, /) + SSE(/, /) + 2 {fix) - fix)}fix) - 2 {fix) - fix)}hx). 

xgQ xgQ 

To prove the theorem it is thereby sufficient to show that, 

Z {fix) - fix)} fix) - 2 {fix) - fix)} fix) = 0. 

First recall that we from Q know that, 

2 {fix)-fix)] =0, VAe5. (40) 

Also, since S S, 

Z {fix)-fix)] =0, VAeJ. (41) 

xgQx 
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We study the first term, Yixenifi^) “/(^)}/(*)> expand the expression for /(x) 
outside the parenthesis and change the order of summation, 




xeQ 


xeQ 


{f{x) - f{x)} 

AeS keA 




AeS ' LfcsA 




s 

AeS 






= 0 , 


where the last transition follows from (jdOp . Using (j41j) we can correspondingly 
show that - f{x)}f{x) = 0. 


B Proof of Theorem [4] 

We study the error sum of squares, 

I 2 


2 {f(^) - /(®)} = 2 “ fix)}f{x) - 2 [{fix) - fix)}f{x) 


xefl 


xefl 


s 

xefl 


xefl 


2 ^^ifix) - fix)}Ylxk 

AeS keA 


11 / 3 * 1 ; {f(x) - /(!)) 

AeS L/ksOa 


S/5 

AeS 


s 

xefl 

A 


2 ^^{fi^) - fix)}Ylxk 

AeS keA 


2 ifi^) - fi^){ 


where the second sum is always zero by fl4T]l . Since S S, the first sum can be 
further split into two parts, 



2 {/(^) -/(^)} 


2 ifi^) - fi^){ 

+ S z’" 

2 {fi^) - fi^){ 

AeS 


AeS 


AeS\S 



where once again the first sum is zero. 


C Proof of Theorem [5] 

From Theorem [T] it follows that it is sufficient to consider a function fix) with 
non-zero interactions /3^ only for A e /Sjij}, since we only need to focus on the 
interactions we want to remove. Thus we have 

fix)= 2 fix) = 

AeS^ijy keA keA 
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and we need to show that then 




r 


< 


l^AuM 

0 


if A u {i,j} e S', 
if A u {i} e S and A u {j} ^ S, 
if A u {i} ^ S and A u {j} e S, 
otherwise. 


We start by defining the sets 


(42) 


i?A = {A\{i}, A\{j}, A\{z,j}} for A e 

and note that these sets are disjoint, and, since we have assumed S to be dense, 
^ S. Defining also the residue set 


T = S\ 



we may write the approximation error f{x) — f{x) in the following form, 


f{x)-f{x) = Yi 



2 n n 

XeRa keA\X I keA\{i,j} 


AsT keA 


Defining 


Af\xi,Xj) = P^XiXj- 2 n 

XeRa fcsA\A 

we have 

f{x)-f{x)= ^ Af^{xi,Xj) 

AeS{ij} kEA\{i,j} AeT fcsA 

Inserting this into dS]) and switching the order of summation we get 

Y {/(^)-/(^)}= S ( Z! 

xeQx xeUx \ As5{i^j} kEA\{i,j} AeT fesA / 


= z (z ^f^ix^,xj) [] ^A:) -z fz 

As5{i^j} yxeQA kEA\{i,j} J AeT XxeQx AcsA / 


= Z f Z Z f Z 

\*S^Au(A\{i,j} / AsT \xEflxuT / 


0 for all A e S 


We now proceed to show that this system of equations has a solution where (3^ = 
0 for A 6 T and XjxsO;, (a\{- } = 0 each A 6 5'pj}. Obviously for 

each A the function Af^{xi, Xj) has only our possible values, namely A/^(0,0), 
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A/^(1,0), A^(0,1) and A/^(l,l). Thus the sum ^fHxi,Xj) is 

simply given as a sum over these four values multiplied by the number of times 
they occur. Consider first the case where A, and thereby also Au (A\{f, j}) does 
not contain i or j. Then the four values A/^(0,0), A/^(1,0), A^(0,1) and 
A/^(l, 1) will occur the same number of times, so 

2 (A/^(0,0) + A/^(l, 0) + A/^(0,1) + A/^(l, 1)) . 

Next consider the case when A, and thereby also A u (A\{i,j}), contains z, 
but not j. Then Xj = 1 in all terms in the sum, so the values A/^(0,0) and 
Af^{0, 1) will not occur, whereas the values A/^(l, 0) and A/^(l, 1) will occur 
the same number of times. Thus, 

2 ^f\xi,x,) = ^hid(Aihd{A/yi,o) + A/^(i,i)}. 

j}) 

When A contains j, but not i we correspondingly get 

2 A/^(x„X,) = {A/^(0,1) + A/^(l, 1)} . 

^GfJAu(A\{i.j}) 

The final case, that A contains both i and j, will never occur since X e S and 
all interaction involving both i and j have been removed from S. We can now 
reach the conclusion that if we can find a solution for 

A/^(0,0) + A/^(l, 0) + A/^(0,1) + A(l, 1) = 0, 

A/^(l,0)+A/yi,l) =0, 

A/yo,l) + A/yi,l) =0, 

for all A e «S'{ij} we also have a solution for (I44p as discussed above. Using our 
expression for A/^(xj,Xj), the above three equations become 

= 0 , 

= 0 , 

/ 3 ^ - = 0 . 

Since the sets R\ are disjoint, the three equations above can be solved separately 
for each A, and the solution is = —A/jA and = /jAMil = ^/3^. 

Together with = 0 for A e T this is equivalent to (|10p in the theorem. 

Inserting the values we have found for /3^ in ()43l) we get 

Af^{xi,Xj) = (^XiXj + \-\xi- ^Xj 

Inserting this into the above expression for /(x) — /(x), and using that we know 
= 0 for A e T we get (llip given in the theorem. 
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